Improving Data Locality by Chunking

نویسندگان

  • Cédric Bastoul
  • Paul Feautrier
چکیده

Cache memories were invented to decouple fast processors from slow memories. However, this decoupling is only partial, and many researchers have attempted to improve cache use by program optimization. Potential benefits are significant since both energy dissipation and performance highly depend on the traffic between memory levels. But modeling the traffic is difficult; this observation has led to the use of heuristic methods for steering program transformations. In this paper, we propose another approach: we simplify the cache model and we organize the target program in such a way that an asymptotic evaluation of the memory traffic is possible. This information is used by our optimization algorithm in order to find the best reordering of the program operations, at least in an asymptotic sense. Our method optimizes both temporal and spatial locality. It can be applied to any static control program with arbitrary dependences. The optimizer has been partially implemented and applied to non-trivial programs. We present experimental evidence that the amount of cache misses is drastically reduced with corresponding performance improvements.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Control of loop parallelism in multithreaded code

Due to the large amount of potential parallelism, resource management is a critical issue in multithreaded architectures. The challenge in code generation is to control the parallelism without reducing the machines ability to exploit it. Controlled parallelism reduces idle time, communication, and delay caused by synchronization. At the same time it increases the potential for exploitation of p...

متن کامل

Hierarchical Chunking in Classifier Systems

Two standard schemes for learning in classifier systems have been proposed in the literature: the bucket brigade algorithm (BBA) and the profit sharing plan (PSP). The BBA is a local learning scheme which requires less memory and lower peak computation than the PSP, whereas the PSP is a global learning scheme which typically achieves a clearly better performance than the BBA. This “requirement ...

متن کامل

Gemini: A Computation-Centric Distributed Graph Processing System

Traditionally distributed graph processing systems have largely focused on scalability through the optimizations of inter-node communication and load balance. However, they often deliver unsatisfactory overall processing efficiency compared with shared-memory graph computing frameworks. We analyze the behavior of several graph-parallel systems and find that the added overhead for achieving scal...

متن کامل

Resource Management in Dataflow-Based Multithreaded Execution

Due to the large amount of potential parallelism, resource management is a critical issue in multithreaded execution. The challenge in code generation is to control the parallelism without reducing the machine's ability to exploit it. Controlled parallelism reduces idle time, communication, and delay caused by synchronization. At the same time it increases the potential for exploitation of prog...

متن کامل

Ddelta: A deduplication-inspired fast delta compression approach

Delta compression is an efficient data reduction approach to removing redundancy among similar data chunks and files in storage systems. One of the main challenges facing delta compression is its low encoding speed, a worsening problem in face of the steadily increasing storage and network bandwidth and speed. In this paper, we present Ddelta, a deduplication-inspired fast delta compression sch...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003